Approximate String Matching Techniques for Effective CLIR Among Indian Languages

نویسندگان

  • Ranbeer Makin
  • Nikita Pandey
  • Prasad Pingali
  • Vasudeva Varma
چکیده

Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore approximate string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dictionary-independent translation in CLIR between closely related languages

This paper presents results from a study, where fuzzy string matching techniques were used as the sole query translation technique in Cross Language Information Retrieval (CLIR) between the closely related languages Swedish and Norwegian. It is a novel research idea to apply only fuzzy string matching techniques in query translation. Closely related languages share a number of words that are cr...

متن کامل

Handling OOV Words in Indian-language - English CLIR

Because of the lack of resources Cross-lingual information retrieval is a difficult task for many Indian languages. Google Translate provides an easy way of translation from Indian languages to English but due to lexicon limitations most of the out-of-vocabulory words get transliterated letter by letter along with their suffix resulting in an unusually long string. The resulting string often do...

متن کامل

Cross-lingual information access in indigenous languages: a case study in Zulu

We review the applicability of dictionary-based Cross-Language Information Retrieval (CLIR) from Zulu to English. Due to the lack of electronic resources and in particular tools for morphological analysis, novel approaches had to be found to deal with the processes of CLIR. Approximate string matching, combined with a monolingual Zulu word list was used. The results suggest that the disparate v...

متن کامل

Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proxi...

متن کامل

Information access in indigenous languages: a case study in Zulu

This study focuses on the intellectual accessibility of information in indigenous languages, using Zulu, one of the main indigenous languages in South Africa, as a test case. Both Cross-Lingual Information Retrieval (CLIR) and metadata are discussed as possible means of facilitating access and a bilateral approach combining these two methods is proposed. Popular CLIR approaches and their resour...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007